In [1]:
import pandas as pd
import plotly
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
pd.set_option('display.max_columns', 110)
pd.set_option('display.max_rows', 200)
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px

Reading CSV

In [2]:
listing = pd.read_csv('CSV/listings.csv')
reviews = pd.read_csv('CSV/reviews.csv')
In [3]:
listing.head()
Out[3]:
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview notes transit access interaction house_rules thumbnail_url medium_url picture_url xl_picture_url host_id host_url host_name host_since host_location host_about host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications host_has_profile_pic host_identity_verified street neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city state zipcode market smart_location country_code country latitude longitude is_location_exact property_type room_type accommodates bathrooms bedrooms beds bed_type amenities square_feet price weekly_price monthly_price security_deposit cleaning_fee guests_included extra_people minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated has_availability availability_30 availability_60 availability_90 availability_365 calendar_last_scraped number_of_reviews number_of_reviews_ltm first_review last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value requires_license license jurisdiction_names instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 17878 https://www.airbnb.com/rooms/17878 20200121213543 2020-01-22 Very Nice 2Br - Copacabana - WiFi Pls note that special rates apply for Carnival... - large balcony which looks out on pedestrian ... Pls note that special rates apply for Carnival... none This is the best spot in Rio. Everything happe... NaN Excellent location. Close to all major public ... The entire apartment is yours. It's like your ... I will be available throughout your stay shoul... Please leave the apartment in a clean fashion ... NaN NaN https://a0.muscache.com/im/pictures/65320518/3... NaN 68997 https://www.airbnb.com/users/show/68997 Matthias 2010-01-08 Rio de Janeiro, State of Rio de Janeiro, Brazil I am a journalist/writer. Lived for 15 years... within an hour 100% NaN t https://a0.muscache.com/im/pictures/user/67b13... https://a0.muscache.com/im/pictures/user/67b13... Copacabana 2.0 2.0 ['email', 'phone', 'reviews', 'jumio', 'offlin... t t Rio de Janeiro, Rio de Janeiro, Brazil Copacabana Copacabana NaN Rio de Janeiro Rio de Janeiro 22020-050 Rio De Janeiro Rio de Janeiro, Brazil BR Brazil -22.96592 -43.17896 t Condominium Entire home/apt 5 1.0 2.0 2.0 Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... NaN $332.00 NaN NaN $0.00 $378.00 2 $63.00 5 30 5 5 1125 1125 5.0 1125.0 6 weeks ago t 1 7 37 312 2020-01-22 246 26 2010-07-15 2019-12-22 93.0 10.0 10.0 10.0 10.0 10.0 9.0 f NaN NaN t f strict_14_with_grace_period f f 1 1 0 0 2.12
1 21280 https://www.airbnb.com/rooms/21280 20200121213543 2020-01-22 Renovated Modern Apt. Near Beach Immaculately renovated top-floor apartment ove... Immaculately renovated top-floor apartment in ... Immaculately renovated top-floor apartment ove... none This is the best neighborhood in Zona Sul. Fo... NaN The new metro station is just a few steps away... This is an older "Art Deco" style building, so... Someone will be there at check in and check ou... This is a booking agreement for rental of a tw... NaN NaN https://a0.muscache.com/im/pictures/60851312/b... NaN 81163 https://www.airbnb.com/users/show/81163 Jules 2010-02-14 Chicago, Illinois, United States Hi I am Jules and I have a beautiful apartment... within an hour 100% NaN f https://a0.muscache.com/im/users/81163/profile... https://a0.muscache.com/im/users/81163/profile... Ipanema 0.0 0.0 ['email', 'phone', 'reviews', 'kba'] t t Rio de Janeiro, RJ, Brazil Ipanema Ipanema NaN Rio de Janeiro RJ 22420-010 Rio De Janeiro Rio de Janeiro, Brazil BR Brazil -22.98467 -43.19611 t Apartment Entire home/apt 6 2.0 2.0 4.0 Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... NaN $336.00 $3,920.00 $13,836.00 $2,098.00 $210.00 6 $0.00 5 30 5 5 30 30 5.0 30.0 3 weeks ago t 6 12 12 12 2020-01-22 89 1 2014-02-14 2020-01-04 97.0 10.0 10.0 10.0 10.0 10.0 10.0 f NaN NaN f f strict_14_with_grace_period f f 1 1 0 0 1.23
2 25026 https://www.airbnb.com/rooms/25026 20200121213543 2020-01-22 Beautiful Modern Decorated Studio in Copa Our apartment is a little gem, everyone loves ... This newly renovated studio (last renovations ... Our apartment is a little gem, everyone loves ... none Copacabana is a lively neighborhood and the ap... For any stay superior to 15 days, an additiona... At night we recommend you to take taxis only. ... internet wi-fi, cable tv, air cond, ceiling fa... Only at check in, we like to leave our guests ... Smoking outside only. Family building so pleas... NaN NaN https://a0.muscache.com/im/pictures/3003965/68... NaN 102840 https://www.airbnb.com/users/show/102840 Viviane 2010-04-03 Rio de Janeiro, State of Rio de Janeiro, Brazil Hi guys, We're a lovely team of 3 people:\r\n\... within a day 86% NaN f https://a0.muscache.com/im/pictures/user/9e204... https://a0.muscache.com/im/pictures/user/9e204... Copacabana 3.0 3.0 ['email', 'phone', 'facebook', 'reviews', 'jum... t t Rio de Janeiro, Rio de Janeiro, Brazil Copacabana Copacabana NaN Rio de Janeiro Rio de Janeiro 22060-020 Rio De Janeiro Rio de Janeiro, Brazil BR Brazil -22.97712 -43.19045 t Apartment Entire home/apt 2 1.0 1.0 2.0 Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... NaN $159.00 NaN NaN $1,000.00 $250.00 2 $45.00 7 60 7 7 60 60 7.0 60.0 3 days ago t 13 16 16 21 2020-01-22 237 15 2010-06-07 2019-12-18 94.0 9.0 10.0 9.0 10.0 10.0 9.0 f NaN NaN f f strict_14_with_grace_period t t 3 3 0 0 2.02
3 31560 https://www.airbnb.com/rooms/31560 20200121213543 2020-01-22 NICE & COZY 1BDR - IPANEMA BEACH This nice and clean 1 bedroom apartment is loc... This nice and clean 1 bedroom apartment is loc... This nice and clean 1 bedroom apartment is loc... none Die Nachbarschaft von Ipanema ist super lebend... NaN Bus, U-Bahn, Taxi und Leihfahrräder in der Nähe. Die Urlauber dürfen das Badezimmer benutzen, d... NaN So far, I haven't had any problems with guests... NaN NaN https://a0.muscache.com/im/pictures/83114449/2... NaN 135635 https://www.airbnb.com/users/show/135635 Renata 2010-05-31 Rio de Janeiro, Rio de Janeiro, Brazil I was born and raised in Rio de (Website hidde... within an hour 100% NaN t https://a0.muscache.com/im/users/135635/profil... https://a0.muscache.com/im/users/135635/profil... Ipanema 1.0 1.0 ['email', 'phone', 'manual_online', 'facebook'... t t Rio de Janeiro, RJ, Brazil Ipanema Ipanema NaN Rio de Janeiro RJ 22410-003 Rio De Janeiro Rio de Janeiro, Brazil BR Brazil -22.98302 -43.21427 t Apartment Entire home/apt 3 1.0 1.0 2.0 Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... NaN $273.00 NaN NaN $0.00 $84.00 2 $42.00 2 1125 2 5 1125 1125 2.0 1125.0 2 weeks ago t 0 12 40 130 2020-01-22 277 39 2010-07-11 2020-01-20 96.0 10.0 10.0 10.0 10.0 10.0 10.0 f NaN NaN t f strict_14_with_grace_period f f 1 1 0 0 2.39
4 35636 https://www.airbnb.com/rooms/35636 20200121213543 2020-01-22 Cosy flat close to Ipanema beach This cosy apartment is just a few steps away ... The location is extremely convenient, safe and... This cosy apartment is just a few steps away ... none The apartment street is very quiet and safe .... Please include the following information with ... Metro stop just 5 blocks from our place. Buses... NaN NaN Dear Guest, Welcome! We hope you enjoy our apa... NaN NaN https://a0.muscache.com/im/pictures/20009355/3... NaN 153232 https://www.airbnb.com/users/show/153232 Patricia 2010-06-27 San Carlos de Bariloche, Rio Negro, Argentina I am Brazilian and Carioca graphic designer, b... within an hour 100% NaN f https://a0.muscache.com/im/users/153232/profil... https://a0.muscache.com/im/users/153232/profil... Ipanema 1.0 1.0 ['email', 'phone', 'facebook', 'reviews', 'man... t t Rio de Janeiro, Rio de Janeiro, Brazil Ipanema Ipanema NaN Rio de Janeiro Rio de Janeiro 22081-020 Rio De Janeiro Rio de Janeiro, Brazil BR Brazil -22.98816 -43.19359 t Apartment Entire home/apt 2 1.5 1.0 1.0 Real Bed {TV,"Cable TV",Internet,Wifi,"Air conditioning... NaN $378.00 NaN $9,652.00 $1,049.00 $172.00 2 $63.00 2 89 2 2 89 89 2.0 89.0 4 months ago t 2 13 24 108 2020-01-22 174 35 2013-10-22 2020-01-03 94.0 10.0 9.0 10.0 10.0 10.0 9.0 f NaN NaN f f strict_14_with_grace_period t t 1 1 0 0 2.29

Part I: How does price changes with location ?

  • Price prepocessing
In [4]:
# Converting price column to float
listing['price'] = listing['price'].astype(str)
In [5]:
def func_remove_character(x, character):
    """ Returns string with the character removed
    
    Args:
        x (str): The string that will have the character removed
        character (str): character to be removed from x
    
    Returns:
        String with the symbol removed
    
    """
    return x.replace(character, '')
In [6]:
listing['price'] = listing.price.apply(lambda x: func_remove_character(str(x), '$'))
listing['price'] = listing.price.apply(lambda x: func_remove_character(str(x), ','))
In [7]:
listing['price'] = listing['price'].astype(float)
  • Analyzing Price distribution
In [8]:
listing['price'].describe() # 75% of price values is below 600.
Out[8]:
count    34754.000000
mean       645.688180
std       1674.245213
min          0.000000
25%        155.000000
50%        298.000000
75%        600.000000
max      41966.000000
Name: price, dtype: float64
In [9]:
sns.violinplot(x=listing.loc[listing['price'] < 1000, 'price']);

Some listings contains a price with value zero, which is, clearly, a mistake. This values are dropped from the dataset since there are few of them.

In [10]:
listing.loc[listing['price'] <= 0, 'price']
Out[10]:
19690    0.0
20191    0.0
20207    0.0
20208    0.0
20213    0.0
20214    0.0
20234    0.0
Name: price, dtype: float64
In [11]:
listing.drop(listing.loc[listing['price'] <= 0].index, axis = 0, inplace = True)

Since price is not well distributed, a binning technique will be applied in order to better analyze data. The values below were in order to keep similar distributions between classes:

  • 0: [0, 100]
  • 1: (100,150]
  • 2: (150, 200]
  • 3: (200, 250]
  • 4: (250, 300]
  • 5: (300, 400]
  • 6: (400, 600]
  • 7: (600, 1000]
  • 8: (1000, 100000] # the maximum price of the dataset is 41966.0
In [12]:
bins_price = [0, 100, 150, 200, 250, 300, 400, 600, 1000, 100000] 
price_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
In [13]:
listing['price_bins'] = pd.cut(listing['price'], bins_price, labels = price_labels, include_lowest = True)
In [14]:
listing['price_bins'] = listing['price_bins'].astype(int)
In [15]:
listing['price_bins'].hist(bins = 30);

Price VS Location

Analyzing different values for host_neighbourhood

In [16]:
listing['host_neighbourhood'].value_counts()
Out[16]:
Copacabana                           7050
Ipanema                              2623
Barra da Tijuca                      2567
Botafogo                             1162
Leblon                               1122
Recreio dos Bandeirantes              967
Flamengo                              796
Santa Teresa                          735
Lapa                                  559
Laranjeiras                           480
Leme                                  476
Tijuca                                460
Glória                                246
Lagoa                                 238
Catete                                225
Humaitá                               209
Centro                                208
Jardim Botânico                       189
Vila Isabel                           185
Gávea                                 185
Maracanã                              161
São Conrado                           156
Vidigal                               149
Urca                                  102
São Cristóvão                          95
Rio Comprido                           68
Cosme Velho                            66
Estacio                                61
Grajaú                                 56
Praça da Bandeira                      53
Engenho de Dentro                      49
Joá                                    44
Andaraí                                41
Rocha                                  40
Méier                                  40
Del Castilho                           33
Barra de Guaratiba                     33
Engenho Novo                           31
Irajá                                  26
Todos os Santos                        25
Gamboa                                 22
Brás de Pina                           21
Penha                                  17
Lins de Vasconcelos                    17
Bonsucesso                             17
Cachambi                               14
Riachuelo                              13
Saúde                                  12
Sampaio                                11
Santo Cristo                           11
Encantado                              11
Rocinha                                11
Cidade Nova                            11
Consolacao                             10
Guadalupe                              10
Marechal Hermes                        10
Quintino Bocaiúva                      10
Cascadura                               9
Piedade                                 9
Maria da Graça                          9
Ramos                                   7
Catumbi                                 7
Bento Ribeiro                           7
Benfica                                 6
Madureira                               6
Complexo da Maré                        6
Inhaúma                                 6
Engenho da Rainha                       6
Barra                                   6
Pavuna                                  6
Islington                               5
Vila da Penha                           5
Parque Anchieta                         5
Barros Filho                            5
Upper East Side                         4
Chácara Inglesa                         4
Parada de Lucas                         4
Prati                                   4
Olaria                                  4
Penha Circular                          4
Vincente de Carvalho                    4
Pilares                                 4
South Beach                             3
Colégio                                 3
Jardim Paulista                         3
Bela Vista                              3
Cerqueira César                         3
Rocha Miranda                           3
Pinheiros                               3
Paraíso                                 3
Ondina                                  3
Retiro                                  3
Vila Olímpia                            2
Opéra - Grands Boulevards               2
Montmartre                              2
São João de Deus                        2
LB of Brent                             2
Châtelet - Les Halles - Beaubourg       2
Cordovil                                2
Astoria                                 2
Deodoro                                 2
Vila Mariana                            2
Engenheiro Leal                         2
Abolição                                2
Vila Kosmos                             2
Jardin Botânico                         2
East Village                            2
République                              2
Prenzlauer Berg                         1
Anchieta                                1
Manly                                   1
Sion                                    1
Alto do Pina                            1
Vila Clementino                         1
Nine Elms                               1
Greenpoint                              1
Hammersmith                             1
Serra                                   1
Tomás Coelho                            1
Jardim América                          1
Soho                                    1
Higienópolis                            1
Graça                                   1
Trastevere                              1
Almagro                                 1
Commerce - Dupleix                      1
Miami Beach                             1
Parione                                 1
Indische Buurt                          1
Campo Marzo                             1
II Arrondissement                       1
Le Marais                               1
Morumbi                                 1
Holland Park                            1
El Raval                                1
Silom                                   1
San Nicolás                             1
Pacific Beach                           1
Barra de Maricá                         1
Praia de Iracema                        1
Jacaré                                  1
Oswaldo Cruz                            1
Oosterparkbuurt                         1
San Telmo                               1
Nordeste                                1
Vigário Geral                           1
Cavalcante                              1
Kilmainham                              1
Monaco                                  1
Financial District                      1
Stanmore                                1
Coração de Jesus                        1
Caminho das Árvores                     1
Mile End                                1
Santo Antonio                           1
Bethesda, MD                            1
Caju                                    1
Alphabet City                           1
Vila Madalena                           1
Comercio                                1
Honório Gurgel                          1
Campos Elíseos                          1
Williamsburg                            1
Name: host_neighbourhood, dtype: int64

In order to analyze price for diffenrent neibourhoods, the column price_bins was used instead of the price column, since its well distributed between each class, therefore the mean will be less sensitive to outliers.

In [17]:
listing.groupby('host_neighbourhood').agg({'price_bins': 'mean'}).sort_values(by = 'price_bins')
Out[17]:
price_bins
host_neighbourhood
Almagro 0.000000
Vila Kosmos 0.000000
Honório Gurgel 0.000000
Vigário Geral 0.000000
Jacaré 0.000000
Jardim América 0.000000
Kilmainham 0.000000
Sion 0.000000
Oswaldo Cruz 0.000000
Bonsucesso 0.294118
Colégio 0.333333
Barros Filho 0.400000
Abolição 0.500000
LB of Brent 0.500000
Penha Circular 0.500000
Parada de Lucas 0.750000
Brás de Pina 0.761905
Marechal Hermes 0.800000
El Raval 1.000000
Caju 1.000000
Alto do Pina 1.000000
Trastevere 1.000000
Holland Park 1.000000
Sampaio 1.545455
Penha 1.647059
Olaria 1.750000
Estacio 1.754098
Centro 1.759615
Lins de Vasconcelos 1.764706
Inhaúma 1.833333
Todos os Santos 1.960000
Cordovil 2.000000
Anchieta 2.000000
Tomás Coelho 2.000000
Grajaú 2.017857
Encantado 2.090909
Catumbi 2.142857
Lapa 2.193202
Engenho Novo 2.193548
São Cristóvão 2.252632
Cachambi 2.285714
Cascadura 2.444444
Santo Cristo 2.454545
Irajá 2.461538
Del Castilho 2.484848
Vila Olímpia 2.500000
Saúde 2.500000
Gamboa 2.545455
Flamengo 2.640704
Retiro 2.666667
Bento Ribeiro 2.714286
Santa Teresa 2.757823
Maria da Graça 2.777778
Parque Anchieta 2.800000
Praça da Bandeira 2.811321
Engenho da Rainha 2.833333
Madureira 2.833333
Complexo da Maré 2.833333
Glória 2.926829
Rio Comprido 2.955882
Catete 2.960000
Pilares 3.000000
Montmartre 3.000000
Financial District 3.000000
Cidade Nova 3.000000
Manly 3.000000
Santo Antonio 3.000000
Le Marais 3.000000
Vila Clementino 3.000000
Williamsburg 3.000000
Piedade 3.111111
Méier 3.125000
Quintino Bocaiúva 3.200000
Botafogo 3.203098
Barra de Guaratiba 3.272727
Vidigal 3.288591
Tijuca 3.295652
Rocha Miranda 3.333333
Laranjeiras 3.339583
Cosme Velho 3.348485
Guadalupe 3.400000
Ramos 3.428571
Rocinha 3.454545
Châtelet - Les Halles - Beaubourg 3.500000
Deodoro 3.500000
Rocha 3.500000
Riachuelo 3.615385
Paraíso 3.666667
Cerqueira César 3.666667
Vila Isabel 3.681081
Maracanã 3.689441
Islington 3.800000
Vila da Penha 3.800000
Copacabana 3.809220
Barra 3.833333
Humaitá 3.966507
Prati 4.000000
Praia de Iracema 4.000000
Oosterparkbuurt 4.000000
Higienópolis 4.000000
République 4.000000
San Telmo 4.000000
Caminho das Árvores 4.000000
Alphabet City 4.000000
Coração de Jesus 4.000000
Consolacao 4.000000
Urca 4.009804
Leme 4.037815
Andaraí 4.121951
Engenho de Dentro 4.163265
Recreio dos Bandeirantes 4.243020
Chácara Inglesa 4.250000
Benfica 4.333333
Gávea 4.421622
Jardim Botânico 4.518519
São Conrado 4.538462
Ondina 4.666667
Pavuna 4.833333
Ipanema 4.937857
Lagoa 4.966387
Barra da Tijuca 4.973899
Barra de Maricá 5.000000
Mile End 5.000000
Morumbi 5.000000
Serra 5.000000
Bethesda, MD 5.000000
II Arrondissement 5.000000
Vila Mariana 5.000000
Campos Elíseos 5.000000
Cavalcante 5.000000
Miami Beach 5.000000
Leblon 5.128342
Joá 5.204545
South Beach 5.333333
São João de Deus 5.500000
Jardim Paulista 5.666667
Bela Vista 5.666667
Upper East Side 5.750000
Stanmore 6.000000
Astoria 6.000000
Vincente de Carvalho 6.000000
Pinheiros 6.000000
Vila Madalena 6.000000
Engenheiro Leal 6.000000
Nordeste 6.000000
Pacific Beach 6.000000
Parione 6.000000
Opéra - Grands Boulevards 6.500000
Comercio 7.000000
Jardin Botânico 7.000000
San Nicolás 7.000000
Monaco 7.000000
Nine Elms 7.000000
Graça 7.000000
Prenzlauer Berg 7.000000
Hammersmith 7.000000
Commerce - Dupleix 7.000000
East Village 7.000000
Silom 8.000000
Indische Buurt 8.000000
Soho 8.000000
Campo Marzo 8.000000
Greenpoint 8.000000

There are 163 different neighbourhoods, which makes the analysis difficult based on the price_bins mean for every single neighbourhood. There is also the problem that some neighbourhoods, only contains one sample, which can cause wrong intuitions about the location average price (clusterizing some locations could be one solution, but it isn't the best approache).

One way to come across this problem was to make a visual analysis.

In [18]:
# In order to use the function below, you must have a public token to use the map-ox api. Check more on: https://www.mapbox.com
px.set_mapbox_access_token('pk.eyJ1IjoiZGFuaWVsZGFjb3N0YSIsImEiOiJjazZzMGZ0c3gwYncwM2tzNW51d3B2ajUyIn0.U6j8vTW4kIJal4aBWEyDtQ')
fig = px.scatter_mapbox(listing, lat='latitude', lon='longitude', color='price_bins', size_max=20, zoom=9)
fig.show()
In [19]:
# Print Screen from image above, for visualization on GitHub:
In [20]:
from IPython.display import Image
Image(filename='Images/price_location.PNG')
Out[20]:

From the image above its possible to observe that, the price tends to be higher for Airbnbs that are close to the beach. For those people that are more familiar with Rio de Janeiro geography: we can also observe that the price is higher, in average, for the most famous neighbourhoods: Leblon, Ipanema, Lagoa and Barra da Tijuca.

Part II: Does the host response rate affect his review scores rate ?

  • Preprocessing host_response_rate & review_scores_rating

Checking of null values.

In [21]:
listing['host_response_rate'].isnull().sum()
Out[21]:
11907

There are 11907 null values for the column host_response_rate. There isn't a way to fill up this huge gap. Therefore, the analysis will be done with the rest of the dataset, which isn't a poor analysis, but not a perfect one either, since there are still 16434 samples.

In [22]:
# Creating a new dataset with the non null values of host_response_rate and review_scores_rating
response_and_score = listing.loc[(~listing.host_response_rate.isna()) & (~listing.review_scores_rating.isna())]
In [23]:
func_clean_response = lambda x: int(str(x).replace('%', ''))
In [24]:
# removing character '%' from host_response_rate
response_and_score['host_response_rate'] = response_and_score.host_response_rate.apply(lambda x:  func_remove_character(str(x), '%'))
response_and_score['host_response_rate'] = response_and_score['host_response_rate'].astype(int)
In [25]:
response_and_score.review_scores_rating.hist(bins = 30);

As observed above, the host_response_rate has very imbalanced distribuition, which can cause a misleading analysis. Just like it was done for the column price, a binning thecnique will also be applied to this column.

  • 0 : [0, 0.96]
  • 1: (0.96, 100]
In [26]:
score_bins = [0, 96, 100]
score_labels = [0, 1]
In [27]:
response_and_score['review_scores_rating_bins'] =pd.cut(response_and_score['review_scores_rating'], score_bins, labels = score_labels, include_lowest = True)
In [28]:
response_and_score['review_scores_rating_bins'] = response_and_score['review_scores_rating_bins'].astype(int)
In [29]:
response_and_score['review_scores_rating_bins'].hist();
In [30]:
response_and_score['host_response_rate'].hist(bins = 30);

The same technique will be applied for host_response_rate. In this case, the technique is applied in order to facilitate the analysis:

  • 0: [0, 10]
  • 1: (10, 20]
  • 2: (20, 30]
  • 3: (30, 40]
  • 4: (40, 50]
  • 5: (50, 60]
  • 6: (60, 70]
  • 7: (70, 80]
  • 8: (80, 90]
  • 9: (90, 100]
In [31]:
response_rate_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90,100]
response_rate_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In [32]:
response_and_score['host_response_rate_bins'] = pd.cut(response_and_score['host_response_rate'], response_rate_bins, labels = response_rate_labels, include_lowest = True)
In [33]:
response_and_score.groupby('host_response_rate_bins').agg({'review_scores_rating_bins':'mean'}).reset_index()
Out[33]:
host_response_rate_bins review_scores_rating_bins
0 0 0.599369
1 1 0.595238
2 2 0.578947
3 3 0.614130
4 4 0.560322
5 5 0.578014
6 6 0.542636
7 7 0.528649
8 8 0.574247
9 9 0.594645

The result shows that the review scores rating isn't affected by the host response rate. The conclusion goes against the premise that this two variables would be directly proportional. The conclusion isn't an absurde, since not everyone cares about the host response rate, there are more important factors that influences the review scores rate: location, price, hostpitality, home cleanliness and etc.

Part III: Airbnb growth through the years

Based on the datasets avaiables, the review.csv is the one with that has the best features to retrieve this information. The dataset contains the date of each review. Therefore, this analysis will be based on the number of reviews per day. The assumption that it's made here, is that most of people leaves a review after a stay.

After a quick research I found that approximately 60%-70% of guests leaves a review

In [34]:
# Checking for null values
reviews.isnull().sum()
Out[34]:
listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     0
comments         76
dtype: int64

Prepocessing date column and grouping reviews by month. A monthly window was chosen for analysis.

In [35]:
# Convert datime format
func_datetime = lambda x: str(x)[:7]
In [36]:
reviews['date'] = reviews.date.apply(func_datetime)
In [37]:
reviews['date'] = pd.to_datetime(reviews['date'])
In [38]:
reviews['month'] = reviews['date'].dt.month
In [39]:
reviews['year'] = reviews['date'].dt.year
In [40]:
reviews = reviews.sort_values(by = 'date').reset_index(drop = True)
In [41]:
reviews_time_series = reviews.groupby(['date', 'month', 'year']).agg({'listing_id': 'count'}).reset_index()
In [42]:
reviews_time_series.rename(columns = {'listing_id': 'total_reviews'}, inplace = True)
In [43]:
# Checking if all months are presented in every year
reviews_time_series.groupby('year').month.count()
Out[43]:
year
2010     7
2011    12
2012    12
2013    12
2014    12
2015    12
2016    12
2017    12
2018    12
2019    12
2020     1
Name: month, dtype: int64
In [44]:
reviews_time_series.set_index('date', inplace = True)
In [45]:
reviews_time_series.head()
Out[45]:
month year total_reviews
date
2010-06-01 6 2010 1
2010-07-01 7 2010 4
2010-08-01 8 2010 4
2010-09-01 9 2010 7
2010-10-01 10 2010 5

Plotting time series. Total_reviews vs date(month)

In [46]:
reviews_time_series.plot(y = 'total_reviews', linewidth = 5, fontsize = 20, figsize = (20,10));
plt.ylabel('Number of reviews', fontsize = 20);
plt.xlabel('Year', fontsize = 20);

From the time series above it's possible to observe a trend. In the following graph the trend was extract from the time series by taking a rolling average, which means that, for each time point, the average of the points is taken on either side of it. For the window size, a window of 12 months was used, in order to observe yearly seasonality.

In [47]:
# Checking the existing of a trend
reviews_time_series.total_reviews.rolling(12).mean().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);

For analyzing seasonality, we must remove the trend. A "differencing" method was used to obtain the seasonality, as presented below. Nevertheless, the graph below wasn't promising, it's not clear what is the time series seasonality

In [48]:
# Checking the existing of seasonality
reviews_time_series.total_reviews.diff().plot(figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Year', fontsize=20);

From the box plot below it's possible to analyze that january has a higher mean value when compared to the other months, which can be explained by the fact that's summer season in Rio de Janeiro

In [49]:
sns.boxplot(x = 'month', y = 'total_reviews', data = reviews_time_series);

Based on a yearly seasonality what is expected to analyze from an autocorrelation plot is a spike at 12 months, meaning that the time series is correlated with itself shifted by twelve months.

From the autocorrelation plot we observe that, the time series is correlated (with a 95% confidence interval ) to its 24 past months. This result isn't very clear, and a further study is required for better conclusions.

In [50]:
pd.plotting.autocorrelation_plot(reviews_time_series['total_reviews']);

Part IV: How price changes based on the number of accommodates?

In [51]:
listing['accommodates'].describe()
Out[51]:
count    34747.000000
mean         4.202233
std          2.606418
min          1.000000
25%          2.000000
50%          4.000000
75%          5.000000
max        160.000000
Name: accommodates, dtype: float64

For higher accomodates, that less sample it has.

In [52]:
listing['accommodates'].hist(bins = 30);

Values that are bigger than 16 are getting groupped together

In [53]:
listing.loc[listing['accommodates'] >= 16,'accommodates'] = 16
In [54]:
listing['accommodates'].hist(bins = 30);

As suspect, the variables are directly proportional.

In [55]:
listing.groupby('accommodates').agg({'price_bins': 'mean'})
Out[55]:
price_bins
accommodates
1 1.633371
2 2.687710
3 3.130326
4 4.386003
5 4.964359
6 5.706570
7 5.880141
8 6.402817
9 6.049550
10 6.721845
11 6.065574
12 6.644531
13 6.538462
14 6.642857
15 6.723077
16 6.100877